The macroecology of butyrate‐producing bacteria via metagenomic assessment of butyrate production capacity

Abstract Butyrate‐producing bacteria are found in many outdoor ecosystems and host organisms, including humans, and are vital to ecosystem functionality and human health. These bacteria ferment organic matter, producing the short‐chain fatty acid butyrate. However, the macroecological influences on their biogeographical distribution remain poorly resolved. Here we aimed to characterise their global distribution together with key explanatory climatic, geographical and physicochemical variables. We developed new normalised butyrate production capacity (BPC) indices derived from global metagenomic (n = 13,078) and Australia‐wide soil 16S rRNA (n = 1331) data, using Geographic Information System (GIS) and modelling techniques to detail their ecological and biogeographical associations. The highest median BPC scores were found in anoxic and fermentative environments, including the human (BPC = 2.99) and non‐human animal gut (BPC = 2.91), and in some plant–soil systems (BPC = 2.33). Within plant–soil systems, roots (BPC = 2.50) and rhizospheres (BPC = 2.34) had the highest median BPC scores. Among soil samples, geographical and climatic variables had the strongest overall effects on BPC scores (variable importance score range = 0.30–0.03), with human population density also making a notable contribution (variable importance score = 0.20). Higher BPC scores were in soils from seasonally productive sandy rangelands, temperate rural residential areas and sites with moderate‐to‐high soil iron concentrations. Abundances of butyrate‐producing bacteria in outdoor soils followed complex ecological patterns influenced by geography, climate, soil chemistry and hydrological fluctuations. These new macroecological insights further our understanding of the ecological patterns of outdoor butyrate‐producing bacteria, with implications for emerging microbially focused ecological and human health policies.


| INTRODUC TI ON
Butyrate-producing bacteria are both associated with host organisms and free-living in outdoor ecosystems.They have critical roles in breaking down organic products including fibres (Baxter et al., 2019) and cellulose (Goldfarb et al., 2011).Given suitable organic substrates and anaerobic conditions, these bacteria can produce butyrate, a shortchain fatty acid, as a metabolic by-product of fermentation.In soils, butyrate is associated with the suppression of soil-borne plant pathogens (Poret-Peterson et al., 2019).In humans and non-human animals, gut-associated butyrate provides energy to colonic epithelial cells, maintains gut barrier stability, has anti-inflammatory and immunomodulatory effects, and strengthens the integrity of the blood-brain barrier (Bedford & Gong, 2018;Brame et al., 2021;Knox et al., 2022;Rivière et al., 2016).As such, butyrate production by bacteria is now understood to be a vital part of host-microbe symbioses.
Efforts to quantify the taxonomic and functional abundances of butyrate-producing bacteria through metagenomic analyses have realised multiple challenges.For example, in the SEED functional genome annotation system (Overbeek et al., 2005; https:// pubse ed.these ed. org/ ), the subsystem 'Acetyl-CoA fermentation to butyrate' (https:// pubse ed.these ed.org/ ) includes 20 genes involved in the butyrate production pathway, and individual bacterial genome isolates show varying copies and synteny of the genes for the enzymes (Vital et al., 2014).
Quantifying complete functional pathway abundances within samples can therefore be computationally time-consuming and expensive, limiting large-scale sample comparisons.Alternately, molecular methods that directly measure butyrate can provide insights into the functional potential of bacterial communities in samples.However, to our knowledge, few studies have focused on butyrate in outdoor ecosystems (Brame et al., 2021;Liu et al., 2011), and no studies to date have matched soil bacterial short-chain fatty acids and metagenomics on a global biogeographical scale.Therefore, we aimed to develop a novel method to estimate a sample's potential for butyrate production in a computationally agile way.This workflow could then be scaled up to examine large numbers of samples from a broad spectrum of sources, including outdoor ecosystems.Soils represent a key reservoir for microbial diversity in ecosystems (Fierer & Jackson, 2006) and provide a focus for us to examine conditions that may support butyrate producers.Soil butyrate producers may be involved in soil-plant-organic matter cycling processes and be ingested and expelled from the digestive tract of soil fauna (e.g.earthworms; Thakuria et al., 2010).At the same time, non-human animal gut-associated bacteria are regularly dispersed into outdoor ecosystems, where they can settle into soils (Blum et al., 2019).Moreover, bacteria in the superficial layer of soils can be dispersed into the air (Polymenakou, 2012;Robinson et al., 2020) and transmitted to hosts (Liddicoat et al., 2020).However, almost no existing studies have quantified outdoor abundances of butyrateproducing bacteria.As such, examining the distribution of butyrateproducing bacteria among the ecological components of their transmission cycle could provide more detailed insight into epidemiological and ecological knowledge of these bacteria, including conditions and mechanisms that support persistence and transmission.Macroecological knowledge of butyrate-producing bacteria could allow urban landscape architects and managers to understand how certain environmental conditions contribute to the human and non-human exposome.Moreover, it could also allow researchers to develop ways of selecting for potentially health-promoting bacterial assemblages in the environment (Brame et al., 2021).
Here, we examined a broad variety of global sample sources and ecological conditions (e.g.climate, geography and soil physicochemical characteristics) to identify associations with butyrate production abundances.We developed two new normalised indices to estimate and compare the butyrate production capacity (BPC) of representative samples from both metagenomic (BPC meta ) and 16S rRNA amplicon (BPC 16S ) data.We combined global metagenomic datasets with continent-wide Australian 16S rRNA amplicon data to provide insight into the global macroecology of butyrate-producing bacteria.Specifically, we aimed to: (a) determine the butyrate production capacities across a range of hosts and ecological conditions using our novel BPC meta scores; (b) provide a more detailed comparison of BPC meta scores across subcategories within each broader source group; (c) conduct an in-depth analysis of Australian soil samples using BPC 16S scores to describe ecological associations with soil butyrate production capacity; and (d) examine the potential influence of human population density on abundances of soil butyrate producers.
By developing novel-targeted formulae and collating multiple types of data and analyses, we were able to synthesise a novel and in-depth view into the global macroecology of butyrate-producing bacteria and the interlinkages between their hosts and ecosystems.

| Method overview
To compare the bacterial butyrate production capacity of samples across a broad spectrum of hosts and ecosystems, we chose to analyse samples based on two types of bacterial sequencing data: shotgun metagenomic sequences and 16S rRNA amplicons.Each data type required the interrogation of separate databases, using separate statistical tools for analysis and visualisation.We developed novel formulae for each data type to estimate the butyrate production capacity (BPC) for each sample (detailed below).Each formula included a normalisation step to allow BPC scores between samples (within a data type) to be directly compared (Table 1).

| Gene selection and metagenome database interrogation
To characterise the global distribution of butyrate-producing bacteria, we analysed shotgun metagenomic datasets.To begin, the butanoate (butyrate) synthesis pathways were reviewed using Seed viewer subsystems (https:// pubse ed.these ed.org/ ) and the KEGG pathway (Kanehisa et al., 2016; https:// www.genome.jp/ kegg/ pathw ay.html) to determine the genes that code for enzymes that are part of the butyrate production pathway.Based on these pathways, the following two genes were chosen for further analysis: buk (butyrate kinase) and atoA (acetate-CoA:acetoacetyl-CoA transferase subunit beta).
ACADS/bcd (butyryl-CoA dehydrogenase) and ptb (phosphate butyryltransferase) were analysed but excluded (all gene decisions explained in Table S1).The included genes atoA and buk participate in each of the two terminal pathways.The Integrated Microbial Genomes & Microbiomes (IMG/M) database (Chen et al., 2021) was then searched for each butyrate-related gene to obtain their mean counts among genomes with at least one copy of either gene (Table S2).At the time of download, IMG/M utilised Annotation Pipeline v.5.0 protocols (release date October 2018) for functional annotations (Chen et al., 2019).
We next searched global metagenomic databases for atoA and buk to find metagenomes that suggest the potential presence of butyrate-producing bacteria.Initial gene and translated gene searches of metagenomics data via the 'Searching SRA' web facility (https:// www.searc hsra.org) using bowtie2 and diamond yielded low numbers of samples and/or high E-values.The largest datasets came from searching IMG/M using EC numbers for each butyrate-production enzyme (butyrate kinase = EC 2.7.2.7, acetate-CoA:acetoacetyl-CoA transferase subunit beta = EC 2.8.3.8) as well as three enzymes with single-copy genes used for subsequent normalisation (phenylalanine-rRNA ligase = EC 6.1.1.20;guanylate kinase = EC 2.7.4.8;.Sample datasets with the genes atoA (n = 21,147) and buk (n = 16,263) were downloaded as our starting point for metagenomics analysis.Among the downloaded datasets, 14,407 datasets had both genes, and many datasets had one gene but not the other (atoA only n = 6330 and buk only n = 1856).We then collated these datasets to create an initial set of 22,593 unique metagenomic samples (Table S3).
Counts for each butyrate-production gene were normalised by dividing by counts of the single-copy gene pheS, which codes for the protein phenylalanine-tRNA ligase alpha subunit and was used as a proxy for total genome count.Counts for two other single-copy genes (GUK1: guanylate kinase and alaS: alanine-tRNA ligase) were also inspected, but they were not used because GUK1 searches showed low counts, and alaS showed slightly different but proportional counts to pheS, which validated the usage of pheS to normalise estimates of total genomes in the samples.However, 115 samples did not include pheS count data and were subsequently removed from our analysis.Based on examination of data distribution (distribution shown in Figure S1), samples with a low pheS count <100 (n = 5479) and samples with a (buk + atoA)/pheS ratio >30% (n = 804) were removed from our analysis to minimise bias from samples with a low normalising pheS count.Outlier samples with pheS count >50,000 (n = 19) were also removed from our analysis.In addition, samples that did not fall within the scope of our research question (n = 3098), such as deep subsurface, contaminated (e.g.uranium-contaminated where SCG = single copy gene (pheS); Gene1 = buk, Gene 2 = atoA; CountGene1, CountSCG are from global metagenomics sample datasets; MeanGeneXCopies = mean count of copies of gene X among all genomes found from searches of gene X within the IMG/M genome database.
Once BPC meta scores were computed and added to the spreadsheet using Excel formulae, the samples were sorted into six categories: soil and terrestrial sediments, aquatic, human, non-human animal, plant and agro-industrial (Table S4).Samples were then grouped by subcategories for further analysis: human samples were grouped by body compartment; non-human animal samples were grouped by vertebrate/invertebrate and by phylum; plants were grouped by compartment; soil samples were grouped by anthropogenic biome classification (i.e.anthromes, which are 'terrestrial biomes based on global patterns of sustained, direct human interaction with ecosystems'; Ellis & Ramankutty, 2008); aquatic samples were grouped by source subcategory; and agro-industrial samples were grouped by source site.To reduce bias within the anthrome 'Class' categories, two studies whose samples accounted for >50% of the total class samples were removed from the analysis.
To identify whether our BPC meta formula was estimating butyrate production rather than general anaerobicity, we adapted it to estimate ethanol production, which also requires anaerobic conditions.
The butyrate synthesis genes were replaced with the terminal gene for alcohol dehydrogenase (ADH, EC 1.1.1.1)to derive an Ethanol Production Capacity (EPC) score.Comparing the resulting EPC scores of the soil metagenomic samples with their BPC meta scores showed a negligible correlation (Figure S2).This further validated that our BPC meta scores were specific to butyrate production.

| Statistics
Statistics were done in R (version 4.0.2;R Core Team, 2020).
The Shapiro-Wilk test was used to determine the normality of distribution.In each category the BPC meta scores did not fit a normal distribution, and either the non-parametric Kruskal-Wallis test or the Wilcoxon rank-sum test was used to test for between-group variation.Due to a high n in some subgroups, a post hoc Dunn test with Bonferroni correction was used to compare subgroup pairwise differences at α = .05.The R package 'ggplot2' (version 3.3.5; Wickham, 2016) was used for data visualisation.Mapping of samples using the pseudocylindric 'Robinson' projection in QGIS (v 3.2.2; QGIS Development Team, 2020) was performed on 2850 sample metadata coordinates after excluding 360 samples with coordinates with less than two decimal points and 153 samples with no coordinates.

| BPC scores for 16S rRNA amplicon samples
To assemble 16S rRNA gene abundance data in Australian soil samples, the Australian Microbiome Initiative (Bissett et al., 2016) database was queried for the following parameters: Amplicon = '27F519R', Kingdom = 'bacteria', Environment = 'is soil', Depth = 'between 1 and 10' (cm).We downloaded the abundances of sequences with 100% identity threshold (zero-radius operational taxonomic units or zOTU) and metadata for each resulting sample (n = 3023).We used the phyloseq package (McMurdie & Holmes, 2013) for managing and cleaning the 16S rRNA data.We removed all 'chloroplast' and 'mitochondria' data.We removed low abundance zOTUs that did not occur in at least two samples and had total counts of <20, as these may have arisen from processing errors.In addition, we kept only samples with total number of sequences between 30,000 and 500,000 to remove outliers and samples with low sequence depth.The resulting sample count was n = 2795.To normalise the data and reduce bias in comparing samples, we transformed the zOTU abundances into relative abundances.
Because 16S rRNA data can have a relatively poor resolution at the species level (Jovel et al., 2016) as well as many unclassified genera, we focused our BPC 16S derivation on family-level data.
Using the Genome Taxonomy Database (GTDB) website interface and the set of putative butyrate-producing species (n = 118) from Vital et al. (2014), we collated the families that included members in this list of butyrate-producers (n = 54, Table S5).
where family1 = Acetonemaceae, family 2 = Acidaminococcaceae, … (see Table S5 for a list of all 54 families).Count zOTUs in each butyrateproducing family are from Australian Microbiome Initiative datasets.# butyrate-producing species (and total binomial species) in each family were evaluated from the entire GTDB.

| Modelling ecological associations of BPC 16S scores
To provide further macroecological context to butyrate-producing bacteria in soils, BPC 16S scores were associated with geographically paired ecological metadata.We chose 16S rRNA amplicon-based studies for this analysis because many soil studies in Australia have utilised 16S rRNA data, and the Australian Microbiome Initiative facilitated access to a continental coverage of data collected via consistent sampling and bioinformatic protocols.By selecting a substantial yet manageable spatial scale (i.e.continental Australia), we efficiently examined associations of a larger pool of ecological characteristics with BPC 16S scores.We also used ecological metadata from sources focusing solely on Australia (e.g.Atlas of Living Australia), which differs from the metadata sources utilised in our global analyses (e.g.anthropogenic biomes).
To associate ecological data with BPC scores, we first removed Geoscience Australia (silica index; Cudahy et al., 2012).We used the best available resolution of source data as supplied to avoid introducing additional noise or bias into our analyses.For example, certain analytical test results were available from sample metadata corresponding to 16S rRNA amplicon data, while other ecological covariate data were extracted from gridded spatial ecological layers at points corresponding to the site locations.
Analysis of the predictor variables showed multiple instances of collinearity (e.g.r > .80 or Variable Inflation Factor scores >15), and some scatterplots generated showed a curvilinear relationship with BPC 16S scores (scatterplots shown in Figure S3).Therefore, we used two approaches (detailed below) that were less influenced by collinearity for subsequent analyses: principal components analysis into k-means clustering and decision tree modelling via Random Forest (Breiman, 2001).Because many samples did not include soil physicochemical data, removing incomplete cases (n = 1122) left 1331 samples for our analyses (Table S7).A further five samples were also removed as they were characterised by outlying values of continuous variables.
We scaled and analysed the 43 continuous predictor variables using principal component analysis to reduce the dimensionality of the variables.PC1 and PC2 explained 27.6% and 14.4% of variance, respectively, and were selected for data visualisations (Figure S4).
We then used k-means clustering on scaled original data to assign the samples into clusters.The optimal number of clusters was examined using the 'elbow' method, silhouette method and gap statistic method.While four was considered an optimal number of clusters, we examined both four and five clusters.We found that the additional fifth cluster more distinctly separated land types.Thus, the five-cluster approach was selected for analysis.
The resulting cluster data were collated, and BPC 16S scores were then matched and returned to the dataset.Median values were calculated for each variable in each cluster, revealing ecological trends distinct to each cluster.Between-cluster significance was examined using Kruskal-Wallis tests.We gave each cluster a generalised description and plotted the sample geospatial coordinates into maps using 'ggmap' (Kahle & Wickham, 2013) and Google maps to visualise their geographical distributions.
We then used Random Forest regression modelling (Breiman, 2001) via the R package 'spatial RF' (Benito, 2021) to discern variable importance results and obtain partial dependence plots for each variable against BPC 16S scores.Only the 43 continuous variables were included in this analysis due to 'spatial RF' package limitations.The model fit was estimated using out-ofbag error from the bootstrap.To reduce multicollinearity, highly correlated predictor variables (r > .8 or VIF > 12) were removed (n = 9).Tuning the hyperparameters of the model (mtry = 24, num.trees = 500, min.nodesize = 5) improved its performance (R 2 ) by .006.Spatial autocorrelation of the residuals was then minimised while fitting the spatial regression model.The resulting Random Forest decision tree model explained 46.5% of the variance in our BPC 16S dataset.The variable importance plot was created using random permutations for each predictor variable's values in outof-bag data, then calculating the mean decrease in node impurity.importance.Partial dependence plots were then generated and confirmed the non-linear relationship of most variables with BPC 16S scores (Figure S5).

| Global distribution of butyrate-producing bacteria
Metagenomes with genes for butyrate production were found on every continent, in every ocean and in 89 countries (Figure 1a).

| Non-human animal hosts
Non-human animal host-associated samples included in our analysis showed that the human gut median BPC score (3.19, n = 800) was similar to the primate gut (3.12, n = 26).

| Plant hosts
Our dataset included 1006 plant-associated samples.

| Ecological characteristics associate with soil BPC scores
When clustering the Australian 16S rRNA soil samples on ecological metadata, each cluster aligned with distinct representative land types, which were given the following descriptive titles: 'arid inland clay plains' (cluster 1), 'seasonally productive sandy soils' (cluster 2), 'sandy inland deserts' (cluster 3), 'temperate urban hinterland' (cluster 4) and 'wet, cold, acidic, vegetated montane' (cluster 5).Mapping the geographical locations of the sample sites showed consistency with the clustered land type descriptions (Figure 3A).Median BPC 16S scores varied significantly between the clusters (Kruskal-Wallis test:  cold, acidic, vegetated montane' cluster (0.49, n = 208).Principal components analysis of all continuous ecological predictor variables showed two distinct patterns of axial distribution, ecological wetness and greenness and soil fertility, and dimensions 1 and 2 explained 27.6% and 14.4% of the variation in the data respectively (Figure 3C).
The cluster with the highest median BPC 16S score was the 'seasonally productive sandy soil', which has low clay content, low levels of cations, high precipitation seasonality and mean temperature, and generally moderate topographic relief and Prescott index (a measure of soil water balance based on average precipitation and potential evaporation).The cluster with the second-highest median BPC 16S score was 'temperate urban hinterland', which is generally moderate in elevation, annual rainfall, topographic relief, clay and soil fertility, and has high levels of zinc and manganese.The cluster with the lowest median BPC 16S score, 'wet, cold, acidic, vegetated montane', had high elevation and topographic relief, colder mean annual temperature, high annual rainfall and aridity index, consistent rainfall levels throughout the year, high soil organic carbon content, and high soil iron and aluminium content (where soil aluminium content and pH are inversely correlated).The two additional clusters, 'arid inland clay plains' and 'sandy inland desert', also had distinct characteristics (Table S8).A separate analysis of categorical variables also showed the highest BPC 16S scores among specific land cover types ('built-up' and 'plantation'), land use types (grazing of native pastures' and 'rural residential') and anthromes ('mixed settlements' and 'remote rangelands') (Table 2 and Table S9).
The variable importance plot from Random Forest decision tree analysis showed that geography most closely associates with the BPC 16S scores in soils (Figure 4a).The eight top predictor variables were longitude, elevation, iron, topographic relief, latitude, precipitation seasonality, population density and the Prescott index.Specifically, low-to-moderate elevation, higher soil iron, and moderate-to-high topographic relief showed a close relationship with higher BPC 16S scores (Figure 4b).Partial dependence plots for each variable are shown in Figure S5.

| DISCUSS ION
We report macroecological patterns of butyrate-producing bacteria via our novel BPC formulae that use both metagenomic and 16S rRNA amplicon bacterial data.Our BPC meta data showed that samples from anaerobic and fermentative conditions, such as the animal gut compartment and soil rhizosphere, had increased genomic potential for butyrate production.Our BPC 16S results showed that geographical and climatic variables, soil iron and human population density were key ecological covariates of soil butyrate production capacity.Soils from sandy, sparsely populated rangelands with high precipitation seasonality, as well as temperate peri-urban sites with greater soil TA B L E 2 Categorical variable subcategories with the highest butyrate production capacity (BPC 16S ) scores.a fertility, had higher BPC 16S scores.We discuss the novel and distinct macroecological patterns of butyrate production observed below.

| Anaerobic conditions associated with higher butyrate-production gene abundances
We show that samples from the chordate gut and anaerobic digesters had the highest BPC meta scores.These results are consistent with the knowledge that butyrate production occurs in anaerobic compartments, such as the human gut and anaerobic digesters (Conrad, 2020;Liu et al., 2016).Expectedly, air-exposed skin surface, animals without a digestive system (e.g.sponges) and activated sludge from aeration tanks had among the lowest BPC meta scores for their respective categories.However, our results also show that this anaerobic requirement is maintained within plant-soil systems that are more susceptible to complex ecological dynamics.Underground compartments, such as roots and rhizospheres, had significantly higher median BPC meta scores than air-exposed compartments such as leaf surfaces.Uteau et al. (2015) demonstrated within plant rhizospheres a spatial oxygen gradient based on the presence of oxygen-filled pores.They also found that episodic water-saturated conditions within the rhizosphere decreased the oxygen partial pressures substantially, which may help explain our finding that topographic relief, precipitation seasonality, and evapotranspiration rates play important roles in the butyrate production capacity within soils.We report an association between the population density of humans and soil BPC 16S scores.Soils from the 'built-up' land cover, 'mixed settlements' anthrome and 'temperate urban hinterlands' cluster showed among the highest BPC 16S scores.In contrast, soils from the 'remote woodlands' and 'wild woodlands' anthromes had among the lowest median BPC 16S scores.The anthropogenic influence on terrestrial ecosystems is well-known (Ellis et al., 2021), and our findings show this influence extends to microbial abundances, particularly butyrate-producing bacteria.
We show that 'temperate urban hinterlands' are a substantial reservoir of butyrate-producing bacteria.Geographical and climatic data show that these Australian sample sites are near major cities and are moderate in their elevation, mean temperature, topographic relief and annual rainfall.Thus, the rainfall tends to run off rather than accumulate, which creates the potential for fluctuating hydrology.Additionally, the increased BPC 16S scores from these peri-urban locations may be influenced by inadvertent dissemination and artificial soil 'contamination' with gut-associated bacteria from people and their pets (Blum et al., 2019), possibly contributing into a transmission cycle of butyrate producers.On the other hand, Australia's major cities and hinterlands are typically coastal, often with riverfloodplain systems and areas of fertile soils that were attractive to European settlers.Thus, our BPC 16S results suggest an association between high human population density and high BPC scores.
However, the direction of influence remains a compelling question that may be of future research interest.

| Hot, seasonally productive sandy soils
Our cluster 'seasonally productive sandy soils' had the highest BPC scores.These soils associated with high precipitation seasonality, which was the climatic variable with the strongest importance toward BPC scores.Hydrological fluctuations in soils have been shown to induce changes in both bacterial community structure and exometabolites (extracellular metabolites).For example, the wetting process in dry soil biocrusts can induce a shift from a cyanobacteria-dominated to a Firmicutes-dominated community within 49.5 h (Swenson et al., 2017).RoyChowdhury et al. (2022) also reported that soil abundances of Firmicutes increased in a dryto-wet transition, even more so than in fixed saturated conditions.
Because most butyrate producers are within the phylum Firmicutes, hydrological fluctuations may play a key role in butyrate production.
The 'seasonally productive sandy soils' cluster also had a high mean temperature, low soil fertility and low population density, which differs from the above patterns of higher BPC scores around temperate residential areas.Furthermore, these soils had high nonpersistent vegetation cover (Table S8).Seasonal rainfall could lead to flushes of green growth, followed by a suitable climate to enable microbial breakdown.Sandy soils would indicate less protection of organic matter associated with clay minerals, which could enhance microbial decomposition of existing organic matter (Johnston et al., 2009).This suggests a high turnover system, and indeed this cluster contains 67% of our samples characterised by land use as 'cropping lands' (Table S7).

| Soil iron associated with BPC
We observed that iron had the strongest association with BPC scores.Our partial dependency plot shows moderate soil iron levels associated with decreased BPC scores, but BPC scores then increased with higher iron levels.This result is consistent with findings from Dostal et al. (2015), where high iron levels enhanced butyrate production in Roseburia intestinalis cultures.Additionally, the authors' in vitro child gut fermentation model showed that a strong iron deficiency significantly decreased butyrate production.
It is worth noting that the butyrate production pathway requires several ferredoxins and ferredoxin-like proteins, which are proteins structured with iron, during the reduction of crotonyl-CoA into butyryl-CoA (Chowdhury et al., 2015).On the other hand, our results also show that 'wet, cold, acidic montane', the cluster with the lowest median BPC 16S score, also had a high concentration of iron. Lee et al. (2001) found that butyrate production from a sucrose solution began to decrease as iron concentration increased above 20 mg/L.Iron availability is also pH dependent, increasing in acidic conditions.Thus, the relationship between iron concentrations and butyrate production is not linear and may rely on a suite of conditions, possibly including the presence of wetting/drying cycles, that work in tandem to create the potential for butyrate production.

| Future research directions and limitations
During the development of our methods, several limitations of our study became apparent.
To maximise the precision of our butyrate producer database, we chose to use species-level classifications via GTDB representative species.However, this may have inadvertently created inconsistent data from species with multiple strains (sometimes hundreds of strains are present within a species), among which some may be butyrate producers and others not.Analysis at the strain level could provide a higher resolution of data, which should be a future research priority.In addition, our data were genomic only and did not include direct measurements of butyrate concentrations; this would require different data types (e.g.metabolomic data) and pipelines.
Future studies that butyrate concentrations in relation to butyrate-producing taxonomic and functional gene abundances could more precisely reveal conditions that promote active butyrate production beyond our estimations.Furthermore, because our 16S rRNA data came only from Australia, our modelling may not be generalisable to global conditions that exceed the ranges of our ecological covariate data.For example, the height of mountains on the Australian mainland does not exceed 2228 m; thus, our mountain cluster modelling may not fit other countries with higher mountains.
In our examination of soil shotgun metagenomic data using anthromes, the sampling methods were not disclosed and might not have been uniform.Metadata that includes sampling depth or indicates whether the soil is horizontally layered or horizontally homogenised would be necessary to draw more comprehensive conclusions.There is also a possibility that urban soils may also be influenced by either the use of compost, which may still bear bacterial spores or DNA (Subirats et al., 2022), or direct contamination by animal faeces.Thus, we have presented the soil BPC meta results but have refrained from drawing conclusions from them; future studies should include details of the sampling methods, which would reduce such biases.
Finally, our data depend on the capacity of laboratory DNA extraction methods to open endospores.Sampling methods that expose the samples to air may inadvertently cause the sporulation of bacteria.Such methods may subsequently reduce the quantities of DNA extracted from spore-formers, several of which could be butyrate-producing bacteria (Browne et al., 2016).Consistency across sampling and DNA extraction methods in future studies could help improve butyrate-producing bacterial abundance data reliability.

| CON CLUS IONS
Butyrate-producing bacteria provide critical ecosystem services for hosts and environments, including humans and soils.Our study focused on the distribution and ecological associations of these bacteria, building new knowledge of their roles in human-plantsoil ecosystems.While we show that geography and precipitation had strong effects on butyrate production capacity in outdoor soils, we also present new evidence that many outdoor ecosystems influenced by human-associated processes may be key reservoirs of butyrate producers.Because nearly 60% of the world's population now lives in urban areas (Güneralp et al., 2020), understanding the influence of dense human populations on outdoor microbiomes is essential via human impacts on soil-human health linkages (Delgado-Baquerizo et al., 2018;Kondo et al., 2018;Watkins et al., 2020).Our study helps advance these research areas.Future assessment of butyrate-production capacity across fine spatial scales (i.e.below global and continental scales, as used here) will help provide greater detail for city infrastructure planning and further microbiome-based public health and ecology research.
any non-continental Australia samples (n = 342).Covariate data were then collated from a variety of sources and reflected a range of soilforming factors (i.e.SCORPAN variables: S = soil; C = climate; O = organisms; R = relief; P = parent material; A = age; N = spatial location; McBratney et al., 2003) (see Table S6 for a list and description of all ecological SCORPAN variables).We identified 49 predictor variables (43 continuous, 6 categorical) as being relevant to our study, for which data sets were downloaded from the following sources: Australian Microbiome Initiative (e.g.organic carbon, clay content %, conductivity; Bissett et al., 2016), Atlas of Living Australia (e.g. annual temperature range, aridity index annual mean; Belbin, 2011; Williams et al., 2010), Soil and Landscape Grid of Australia (e.g.Prescott index, topographic wetness index; O'Brien, 2021) and FIGURE 1 Butyrateproducing bacteria are found on every continent, in every ocean, and in 89 countries.(a) Map showing study locations of samples with buk and/or atoA genes.(b) Density plots showing frequency distributions of sample Butyrate Production Capacity (BPC meta )scores in the six highest-level groupings (x-axis = BPC meta ).BPC meta score medians rather than means are presented due to non-normal BPC meta score distributions.The range of sample BPC meta scores was from 0.02 to 3.39.Bimodal peaks in five of the six categories may represent divergence between environments supportive and unsupportive of fermentative activity (discussed below).n is the number of samples.
(n = 771) were either direct or proxy (e.g.faecal) measures of animal gut microbiota (n = 448) or were non-gut but host-associated samples (e.g.attine ant fungus gardens, gutless marine worms, n = 323).We first compared non-human animal groupings by vertebrates (median BPC meta score = 3.11, n = 389) and invertebrates (median BPC meta score = 2.76, n = 382) (between-group differences were statistically significant: Wilcoxon rank sum test: W = 22,592, p < .001).We then compared non-human animal samples by taxonomic phylum (betweengroup differences were statistically significant: Kruskal-Wallis test:H = 331,4 df, p < .001; Figure 2b), where Chordata had the highest median BPC meta score (3.11, n = 389) and Porifera (sponges), which lack a gut, had the lowest BPC meta scores (1.87, n = 34).A further comparison F I G U R E 2 Butyrate Production Capacity (BPC meta ) scores vary between host communities and ecological sources.BPC meta score density plots by group subcategories.(a) BPC meta scores of humans, sorted by body compartment.(b) BPC meta scores of non-human animalassociated microbial communities, sorted by class.Note that Porifera do not possess a gut.(c) BPC meta scores of plant-associated samples, grouped into compartments.(d) BPC meta scores of soil samples, grouped into anthropogenic biomes (anthromes) levels that represent human influence on land use.The level 'Cultured' includes woodlands and inhabited drylands.(e) BPC meta scores of aquatic ecosystem samples, grouped into source site categories.(f) BPC meta scores of agricultural and industrial samples, grouped by source site.Activated sludge and anaerobic digesters are common components of wastewater treatment plants.In each of (a-f), Kruskal-Wallis tests show that betweengroup differences were significant at p < .001.Medians sharing a letter are not significantly different by the adjusted Dunn test at the 5% significance level.Boxes show the interquartile range.

Sequenced data type Shotgun metagenomic sequences 16S rRNA amplicon sequences
Thirty model repetitions were used to create the plot of variableSample BPC 16S score = (